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Abstract In [9], a random thresholding method is introduced to select the signif- 
icant, or non null, mean terms among a collection of independent random variables, 
and applied to the problem of recovering the significant coefficients in non ordered 
model selection. We introduce a simple modification which removes the dependency 
of the proposed estimator on a window parameter while maintaining its asymptotic 
properties. A simulation study suggests that both procedures compare favorably to 
standard thresholding approaches, such as multiple testing or model-based clustering, 
in terms of the binary classification risk. An application of the method to the prob- 
lem of activation detection on functional magnetic resonance imaging (fMRI) data is 
discussed. 
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1 Introduction 



In [9], the following model is considered: 

Yi = m + Si, i = 1, . . . ,n, (1) 

where \n are unknown constants, some of which are zero, and Si are independent, 
identically distributed (iid) zero-mean random variables, with known cumulative dis- 
tribution function (cdf) F e . 

Within this model, the problem of selecting the significant coefficients jii ^ 0, 
based on the observations Yi is studied. Such a problem arises in many different 
application areas, such as genomics [ ], or neuroimaging [ ], to cite just a few. 

Many methods have been proposed to perform this task. Multiple testing proce- 
dures for instance (see [7] for an overview of existing methods), have been developed 
to control a certain type I error rate, such as the familywise error rate (FWER), or 
the false discovery rate (FDR) [ ] at a user-fixed level. It can be argued however that 
the choice of a level, which ultimately defines the subset of selected coefficients, is 
arbitrary, as their is no safe guideline to what an 'optimal' level of false detections 
should be. 

An alternative, that allows to control both type I and type II error rates, consists 
in fitting a mixture model to the data, with one class for the null (zero-mean) data, 
and one, or more, for the non-null data. A detection threshold can then be derived, 
which minimizes a certain classification risk, such as the binary risk, associated to 
the 0-1 loss function, resulting in a 'naive Bayes' classifier [ ]. The main difficulty 
of this approach lies in the choice of a distribution for the non-null data, which may 
influence significantly the resulting classifier. Many authors have proposed to deal with 
this issue through model selection techniques (see [3, 11, 5] for instance), however it 
remains an open-ended problem. 

In view of these difficulties, the random threshold (RT) approach introduced in [9] 
appears as a promising candidate, since it does not require the specification of a type-I 
error level, nor of a model for the non-zero mean observations. The principle of RT lies 
in estimating the number of significant coefficients, based on a random centering of 
the partial sums of the ordered observations. Because it relies on as little assumptions 
as possible, we expect RT to be more robust than the above-mentioned approaches. 

However, to date very little is known concerning the classification performances of 
RT procedures; [ ] essentially gives a minimal separation speed between null and non- 
null data for the method to attain perfect classification asymptotically. Furthermore, 
the algorithm described therein still depends on a window parameter, which may have 
some influence in presence of noisy data. 

This article describes a simple modification of the RT procedure, which removes its 
dependency on the window parameter, while maintaining its asymptotic properties. 
We then study the classification performances of both techniques using numerical 
experiments, in comparison to the above-mentioned standard approaches. 

The rest of this article is organized as follows. In Section 2, the original RT method 
is reviewed. The variable window extension is introduced in Section 3. In Section 4 
the results of numerical experiments are presented, which show the good properties 
of RT in terms of classification. An application to fMRI data analysis is discussed in 
Section 5, and we conclude in Section 6 by considering the perspectives open by this 
work. 
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2 Original random thresholding procedure 



2.1 Testing the presence of significant coefficients 

We start by recalling how the presence of non-zero means is tested, that is, how the 
null hypothesis Ho : \i% = is tested in [9]. This is done by comparing the cumulative 
sums of the ordered observations to their conditional expectations under Tio , according 
to the following steps: 

1. Order the observations |V(i)| > 1^(2)1 > • • • > 

2. For 2 = 1,..., n, let X (i) = - log(l - F H (\Y (i) |)) 

3. Let Tj = ELi and Q, = ^(r^Tn) 

4. Define the test statistic D n — maxj |Tj ; — Q 3 ;\/y/n. The null hypothesis is rejected 
if D n > d a , with d a such that ¥-u (D n > d a ) < a. 

Note that the cumulative sums are not computed directly from the ordered obser- 
vations, but from the transforms X(i), . . . , which, under Ho, are an ordered series 
of 5(1) random variables. The conditional expectations Qj = Kn (Tj\T n ) can then be 
computed using the following result (see [ ] for details) ; 

Theorem 1 (Conditional expectations of ordered exponential variables) Under 
7-to, the are an ordered series of 8(1) random variables. It follows that: 

i) E* (*w)=£?=a 

a) e Wo (t j ) = j +jEL j 7 

Hi) Eu (TATn) = g^§Tn. 

Furthermore, for n > 100, it is shown in [9] using Monte-Carlo simulations that: 

PH (£>n > 0.65) w 0.05, (2) 

which provides an approximate calibration for the above test. It is illustrated in 
Figure 1, on a dataset of n = 500 observations Y{ simulated according to model (1), 
with noise terms Si sampled from the A/"(0, 1) distribution. 

Under the global null Tio, that is, when jii = 0, (Figure 1, top), the cumulative 
sums (Tj) are seen to follow closely their conditional expectations (Qj). Consequently, 
the resulting test statistic value D n — 0.55, does not exceeds the critical value 0.65 
given by (2). 

In contrast, when we add m = 100 non-zero means, all taken equal to fii = 5., 
to the null data (Figure 1, bottom), the Tjs become substantially larger than their 
expected values Qj under Tio, resulting in a gap between the corresponding curves. 
Note that this gap is most significant around j = 100, because, for all j, Tj is the 
sum of the j largest observations (after transformation), containing mostly non-zero 
means for j < 100. Consequently, the ensuing test statistic D n = 23.44 is far above 
the critical value 0.65. 
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Figure 2: Random threshold procedure: partial cumulative sums (Tkj)j and 
their conditional null expectations (Qkj)j for & = (top), k = 100 (middle), 
and k = 200 (bottom), with window width K n = 200. 
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Figure 3: Random threshold procedure: the sequence rjk on a logarithmic scale; 
its minimum is attained for fc„ =99. 



2.2 Selecting the significant coefficients 

Upon rejection of the null hypothesis, the following task consists in selecting the signif- 
icant coefficients. The procedure for doing so can be interpreted in a data-dependent 
'multiple hypothesis testing' setting, as described hereafter. Consider the null hypoth- 
esis Ho, as defined in Section 2.1, and the set of alternative hypotheses: 



Hi,k ' for any i < fc, ji^ > 0, and /i(fc+i) = . . . = /i( n ) 



0. 



In other terms, 7ii,k corresponds to the hypothesis that the k largest observations 
only have non-zero means. Even though in real- life datasets null and non-null data 
are never perfectly separated, in general one cannot expect more than to discriminate 
between such hypotheses in non-ordered model selection. Note that this is equivalent 
to choosing a certain detection threshold to separate null from non-null data. 

Denote the expectation under 7ii,k (instead ofEn l k )- The RT procedure first 
computes the X^s using the same steps 1. and 2. as in Section 2.1, then adds the 
following steps: 

3. Let K n be some positive integer. For 1 < k < n — K n and 1 < j < K n , compute: 

k+j 
i=fc+l 

Qk,j — ^k(Tk,j\Tk,K n ) 



rjk 



max \T kJ - Qk,j\/Vn. 

l<j<K n 



4. Let k n argmin 1<fc<n _ K n k . 
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Figure 4: Random threshold method for a Gaussian null distribution with 
unknown variance a 2 : the sequence rjkftk) (top) and the sequence &k (bottom). 

As in the preceeding section, for each k X( k +i), . . . , X( k + n -k) is an ordered series of 
5(1) variables under ?ii,k, so Qkj can easily be computed using Theorem 1 (see [9] for 
further details). Heuristically, the partial cumulative sums (Tk,j) are compared to their 
conditional expected values Qk,j under the hypotheses 1~Li,k-> for k = 1, . . . , n — K n . 
The number of significant coefficients is estimated as the index k n corresponding to 
the minimal gap between Tkj and Qkj, as evaluated by rjk. 

This is illustrated in Figure 2, using a a dataset simulated exactly as in Section 2.1, 
that is, with m = 100 significant coefficients. Informally, the procedure uses a sliding 
window with width K n , and compares the cumulative sums (Tj ) within this window to 
their conditional expectations under the hypothesis that the window contains the K n 
largest null terms. For k = 1, the window contains in fact mainly significant terms, 
so that the T{js are well above their expected values, yielding a normalized gap of 
rji = 14.67. For k = 100, the window indeed contains mostly the K n largest null terms, 
so Tioo,j and Qiooj are of the same order, yielding a much smaller gap (77100 = 0.11). 
Finally, for k = 200, the window contains null terms, but not the K n largest, so the 
cumulative sums (Tkj) become lower than their expected values. Consequently, the 
gap increases (77200 = 0.71). Figure 3 shows the complete sequence of rjk values, with 
a clear minimum at k n — 99, close to the true number of significant coefficients. 

3 Extensions and asymptotic properties 
3.1 Unknown distribution extension 

The RT method recalled in the previous Sections may be difficult to apply to real-life 
problems, where the noise distribution F e is in general unknown. In [ ], an extension 
is proposed to the case where F e is a parametric distribution F e (- ; #), with unknown. 
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Quite naturally, this consists in estimating under each hypothesis 7ii,k from the null 
data V(fc+i) , Y(k+2) 3 • • • >^(n)j then using this estimate to derive the transforms X^), 
for z = fe + l,...,fe + i^ n - More precisely, having chosen some positive integer K n , the 
extension consists in performing for 1 < k < n — K n the following steps: 

1. let Ok = 0(Yk+i, • • • , Y n ) be an estimate of 

2. for i = l,...,n, let = - log (l - f] e |(|y (i) | ; 4)) , 

3. for 1 < j < K n , compute: 

• T k ,j(6k) = ES+l X (i)0k) 

• Qk,j(@k) = ^k(Tk,j(0k)\Tk,K n (0k)) 

4. Compute f] k (6 k ) = m&x 1 < j < Kn \T kJ (0 k ) ~ Qk,j(p k )\/y/n. 
Finally, the estimated number of components is given as before by 

k n argmin 1 < fc < n _^ry fc ((9 fc ). 

This simple extension is much more computation intensive than the original procedure, 
since the X^s for i = k + 1, . . . , k + K n must be re-computed for each k, instead 
of once and for all. It is illustrated in Figure 4, using the same simulated dataset 
as in the previous Sections. Here the null distribution is defined as the Gaussian 
A/"(0; cr 2 ), and the unknown variance a 2 is estimated by the usual mean squares: a k = 
5^r=fc+i ^(i)" ^ ne es tiniated number of significant coefficient, k n = 95 is still close 
to its true value, and so is the corresponding standard error estimate a| = 1.05. 

3.2 Varying window extension 

As we have seen previously, the RT procedure depends on a parameter K n which can 
be interpreted as a window width, since r/ k is a function of X(k+i)i • • • iX^ k+Kn y K n 
must be smaller than the number of null coefficients, but at the same time not too 
small, or r] k would become unstable. Hence choosing an appropriate value for K n may 
be a hindrance in practice, especially since we want the RT method to be adaptive 
and depend as little as possible on any form of tuning. 

This issue can be avoided by re-defining r] k as a function of X(k+i), • • , X( n y thus 
replacing the fixed width K n by a varying width n — k, which requires no prior tuning. 
We define the following procedure, starting with the same steps 1. and 2. as in 
Section 2.1, and adding the following steps: 

3. Let K n be a lower bound on the number of null coefficients. For 1 < k < 71 Kn 
and 1 < j < n — k, compute: 

T k,J = Y^i = k + 1 X (i) 

Qk,j — Efc(Tfc ; j |Tfc, n -fc) (3) 

r\ k = maxK^n-fc \T k j - Q k j\/Vn - k. 
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Figure 5: Random threshold with a varying window width: partial cumulative 
sums (Tkj)j and their conditional null expectations (Qkj)j for /c = (top), 
k = 100 (middle), and k = 200 (bottom). 



4. Let k n - argmin 1 < fc < n _ Kri ?7fc. 

In other terms, r/k would be strictly equal to the test statistic D n defined in Sec- 
tion 2.1, if the sequence (^(i))i<z<n where replaced by the subsequence (X^)k-\-i<i< n , 
i.e., the set of null terms under Tii^k- 

Notice that k n is independent from ft n , as long as 77 reaches its global minimum 
on {1, . . . }. The varying window extension presented here can of course be 

combined with the unknown distribution extension in Section 3.1. 

Figure 5 illustrates the varying window extension on the same simulated dataset 
as previously. Intuitively, the 'sliding' window is replaced by a 'shrinking' window, 
which at first encloses all observations, than progressively reduces in width as the 
largest observations are left out. Otherwise the same observations as in the fixed- 
width case hold: when k = 1, the cumulative sums (Tk,j)j are larger than their 
conditional expectations (Qk,j)j, resulting in a large gap (770 = 21.98). This gap is 
considerably reduced when k — 100 as the TJ(. jS and the Q'kjS become of the same 
order (77100 = 0.28), then the gap increases again for k — 200 as the cumulative sums 
become smaller than their expected values (77200 = 0.69). Though not shown here, the 
sequence 77^ attains its minimum in k n = 99, as in the fixed- window case. 

3.3 Asymptotic properties 

The estimator of the number of significant coefficients presented in Section 2.2 is 
consistent. This is the main result in [9], and it can be extended to the varying 
window setting. We start by recalling the following asymptotic framework: 

AF1 There exists t* £ (0,1) and a subset Ik* of {l,...,n}, with k n = [t*n] and 
\Ik*\ = k n , such that jii 7^ if i G Ik* - For all other index, fii — 0. 

AF2 For any i £ Ik* , \fi>i\ > OL n , for a certain sequence (a n ) with a n —> 00 (see [9] for 
further details). 

AF3 K, n /n —> c such that < c < 1 — £*. 

We then have the following result: 

Theorem 2 (Consistency of the random threshold) Let k n stand for the esti- 
mator defined in Section 2.2. Under assumptions AF1, AF2 ; AF3 ; and appropriate 
Von-Mises type conditions on the cdf F e of the errors (see [9] for further details), k n 
is consistent in the sense that: 

> U n ^j -> 0, (4) 

for any positive decreasing sequence (u n ) such that >Jnu n — >> 00. 

This result can be refined by deriving an upper bound, which we do not detail 
here, on the convergence rate of the probability in Equation (4), for a particular choice 
of sequence (u n ). Consistency also holds in the unknown distribution case, under a 
different set of assumptions which we do not recall here, and under the varying window 
extension, as shown in Appendix A. 
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This theorem is interesting in that it gives a convergence rate for k n , provided that 
a minimal signal-to-noise (SNR) ratio is attained, represented by a lower bound a n 
on the absolute values \fii\ of the non-zero means (assumption AF2). Note that, in 
order for the random threshold (or any other threshold for that matter) to asymptot- 
ically separate perfectly null from non-null data, this SNR must necessarily become 
arbitrarily large as the sample size increases. 

However, this theorem provides no clue to what happens when the SNR remains 
bounded, as we expect to be the case in real-life applications. In the remainder of this 
paper, our goal is to explore the behaviour of the RT approach in such cases. 

4 Simulation study 

In order to assess empirically the classification properties of RT, we designed several 
numerical experiments. Our goal was to compare the binary classification risk of the 
RT procedure (with both fixed and varying window width), to those obtained by 
model-based clustering and FDR control techniques. Specifically, we used a mixture- 
model, estimated via an expectation-maximisation (EM) algorithm [ ] to approximate 
the risk minimizing detection threshold, and the Benjamini-Hochberg (BH) procedure 
[ ] to derive a threshold controlling the FDR at a certain level. We dismissed FWER 
control techniques as they essentially yield constant thresholds at a given level, and 
are therefore of little interest when compared to adaptive approaches. 

We considered two cases, depending on whether the null distribution F e was con- 
sidered as known or not. Note that the BH procedure is based on the p- values 
Pi = 1 — F| e |(|>i|), hence it requires that F e be known, whereas this same distri- 
bution can be estimated using the EM algorithm. So in order to compare methods 
on a fair basis, we compared RT to the BH procedure when the null distribution was 
known, and to mixture model fit otherwise. 

4.1 Results with known null distribution 

We chose to simulate the Xi directly in the case of a known null distribution. Datasets 
of n — 10 000 observations where generated, containing each n\ = 1 000 significant 
terms. These were sampled from the Gamma distribution C}(a,f3), where /3 is a scale 
parameter, and the remaining 9 000 null terms were sampled from the £(1) distribution. 

Table 1 shows the average classification risks obtained by the different methods over 
100 simulated datasets and for different choices of the Gamma distribution parameters. 
More precisely, we chose to compute the ratio of each attained risk to the lowest 
achievable (oracle) risk, which makes more sense since perfect classification is in general 
unattainable. 

The binary risk and oracle threshold can be computed as follows. Consider a 
given dataset (Xi, Zi)i, where Zi is a binary variable, equal to if Xi is a null term 
(sampled from the 6(1) distribution), and 1 if Xi is a non-null term (sampled from 
the Gamma distribution). Then the overall classification error associated to a given 
detection threshold t is given by: 

c(t) = !{**>*}+ E 1 {^*<=t}, (5) 
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Table 1: Ratio of binary classification risks with respect to the lowest attainable 
(oracle) risk for FDR control at different levels (top) and for the RT procedure 
with fixed (K n = 5 000) and variable (n n = 5 000) window width (bottom), 
averaged over 100 simulated datasets. 

that is, the sum of type I (false detections) and type II (non detections) errors. The 
oracle threshold is then chosen to minimize this classification error: 

It can be seen that both RT approaches perform in general better than FDR control 
through the BH procedure, with a slight advantage to the varying window extension. 
Most importantly, the classification risks they attain is never more than 1.31 and 1.25 
times the oracle risk for the fixed and varying window version, respectively. In contrast 
to these near optimal performances, whatever the chosen level of FDR control, the BH 
procedure always performs poorly for at least one model, with an average classification 
risk that rises as high as 6.02 times the optimal one in the worst case. 

These results suggest that the RT approach, due to its adaptive nature, is indeed 
more stable than error rate control techniques, that depend on the choice of a false 
detection level, as we had anticipated. Moreover, the excellent performance of the RT 
methods, which attained near optimal risks on the studied datasets, is very encouraging 
for this approach. 

4.2 Results with unknown null distribution 

To illustrate the unknown distribution case, we simulated n — 1000 observations Yi, 
among which m = 100 where sampled from the A/"(/i, a 2 ) distribution with \i > and 
represented the significant terms, and n — m = 900 where sampled from the A/*(0, 1) 
distribution and represented the null terms. We used less observations than in the 
known distribution case because the methods used in the present case are much more 
computer- intensive. 

We implemented an EM algorithm to estimate a two-class Gaussian mixture model 
(GMM) from the data, with one zero-mean class to model the null data. As is often 
the case with iterative algorithms, providing initial values for the model parameters 
was the main problem we encountered. We found an efficient strategy for doing so, 
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Table 2: Top: Ratio of binary classification risks over oracle risk for model- 
based clustering (left) and for the RT procedure with fixed {K n = 5 00) and 
variable {n n = 5 00) window width (middle and right), averaged over 100 simu- 
lated datasets. Bottom: Results for bimodal non-null data (averaged over 100 
simulated datasets). 

taking advantage of the fact that the negative data contained mostly null terms, and 
could provide a good initial guess for the null distribution variance and mixture weight. 
Details of the algorithm are given in Appendix B. 

Table 2, top shows the average ratios of the classification risks obtained by the 
different methods with respect to the oracle risk, over 100 simulated datasets and for 
different choices of the Gaussian distribution parameters for the significant terms. 

All methods performed satisfyingly, yielding close to optimal risks, with model- 
based clustering performing slightly better than the RT methods. This comes as a 
little surprise, since in this case the former approach had several advantages: it was 
based on a parametric model for the significant terms that was precisely the one used 
to simulate the data, and it explicitly minimized the binary classification risk. In 
contrast, the RT approach requires no parametric assumption on the distribution of 
significant terms, and does not explicitly minimize a classification risk, but nevertheless 
gave good results. 

Furthermore, performances of the model-based method can deteriorate when it is 
based on the wrong assumptions. To illustrate this, we simulated n — 5 000 observa- 
tions Yi among which ri\ — 950 where sampled from the A/"(3, 1) distribution, rt2 — 50 
from the A/"(20, 1) distribution, and the remaining 4 000 from the A/"(0, 1) distribution 
and represented the null data. Consequently, the significant terms had a bimodal 
distribution. Most of these terms where next to the null mode, and a small number 
where next to a more distant mode. 

This way, we hoped to trick the mixture model, which assumed a unimodal dis- 
tribution for the significant terms, in detecting only the distant mode, while merging 
the other one with the null distribution. This is exactly what happened, as can be 
seen in Table 2, bottom: the mixture-model fit performs significantly worse than the 
RT approach in this case, the later maintaining a reasonable, though also degraded, 
classification risk. 

Of course it can be argued that such a dataset does not represent a realistic situ- 
ation; our point here is simply to illustrate the increased robustness of RT due to the 
fact that it requires no assumptions other than a noise model. It can also be discussed 
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that an alternative to the simplistic two-class GMM used here would be to allow a 
variable number of classes, combined with a model selection framework [8, 11, 5]. How- 
ever, implementing such complex strategies would be non-trivial, especially concerning 
the algorithm's initialization. This last issue could be addressed for instance by using 
stochastic extensions of the EM, such as the stochastic averaging EM (SAEM) [3], in 
order to reduce dependence on initial values. In contrast to such sophisticated strate- 
gies, the simplicity of the RT approach, which requires minimal implementation and 
virtually no tuning, appears as a key advantage in practice, especially in view of the 
good performances suggested by this study. 

5 Application to fMRI data analysis 

We now apply the random threshold approach to functional magnetic resonance imag- 
ing (fMRI) data analysis. fMRI is a modality of in vivo brain imaging that allows to 
measure the variations of cerebral blood oxygen levels induced by the neural activity of 
a subject lying inside a MRI scanner and submitted to a series of stimuli. A sequence 
of three-dimensional (3D) images of the brain is thus acquired, measuring over time 
a vascular effect of neural activity known as the blood oxygenation level dependent 
(BOLD) effect. From the time series recorded in each voxel, and the occurrence times 
for each stimulus, one may compute an estimate of the BOLD effect of the subject in 
response to any given stimulus, and more generally to any difference or combination 
of stimuli (contrast) [6, 16]. 

Thus, the fMRI data for one subject generally consists in a spatial map of ^-scores 
(Yi, . . . , Y p ), where p is the number of voxel in the search volume (which can be as 
high as 100 000), and Yi the estimated BOLD effect at voxel i. This map of measures 
of cerebral activity, also termed activation map, is plagued by several sources of un- 
certainty: the natural variability of brain activity, and the estimation noise induced 
by the MRI scanner. Thus, model (1) provides a good representation of the activation 
map (Yi, . . . , Y p ), with significant terms corresponding to brain regions involved in the 
task under study. 

In a typical fMRI study however, not one but several subjects are recruited from 
a population of interest, and scanned while submitted to the same series of stimuli. 
Activation maps associated with a given contrast are obtained for each subject, as 
described above, and used as input data for inference at the between- subject level, 
where the goal is to evidence a general brain activity pattern. Mass univariate, or 
voxel-based, detection [ ] is to date the most widely used approach to address this 
question. It starts with normalizing individual images onto a common brain template 
using nonrigid image registration. Next, a t-statistic is computed in each voxel to 
locally assess mean group effects. 

In both single-subject and multi-subject fMRI data analysis, the problem of acti- 
vation detection can be formulated statistically as that of detecting non-zero means 
among a collection of observations. The most common approach consists in threshold- 
ing a statistical map of brain activity [ ] . Multiple-testing techniques are widely used 
[12, 13], as well as mixture- models. The Gamma-Gaussian mixture model (GGM) 
is most often used in this context [ ]. It uses a Gaussian distribution for null, or 
inactivated, data, a Gamma distribution for activated data, and a negative Gamma 
distribution for deactivated data. 
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As these methods suffer from certain limitations, as discussed in the previous 
Sections, the RT appears as an appealing alternative in this context. Hence we decided 
to compare the regions detected by the different approaches, to see if RT succeeded in 
recovering regions known to be involved in certain well- studied cognitive tasks. 

5.1 Data acquisition and preprocessings 

We used a real fMRI dataset from the Localizer database [ ], involving a cohort of 38 
right-handed subjects, and acquired as follows. The participants were presented with 
a series of stimuli or were engaged in tasks such as passive viewing of horizontal or 
vertical checkerboards, left or right click after audio or video instruction, computation 
(subtraction) after video or audio instruction, sentence listening and reading. Events 
occurred randomly in time (mean inter stimulus interval: 3s), with ten occurrences 
per event type, and ten event types in total. 

Functional images were acquired on a General Electric Signa 1.5T scanner using an 
Echo Planar Imaging sequence . Each volume consisted of 34 64 x 64 3 mm-thick axial 
contiguous slices. A session comprised 130 scans. Anatomical Tl weighted images were 
acquired on the same scanner, with a spatial resolution of 1 x 1 x 1.2 mm 3 . Finally, 
the cognitive performance of the subjects was checked using a battery of syntactic and 
computation tasks. 

Single-subject analyses were conducted using SPM5 (http : / / www . f il . ion . ucl . ac . uk) . 
Data were submitted successively to motion correction, slice timing and normalization 
to the MNI template. For each subject, BOLD contrast images were obtained from a 
fixed-effect analysis on all sessions. Group analyses were restricted to the intersection 
of all subjects' whole-brain masks, comprising 43 367 voxels. 

We considered the t-score maps computed for different contrasts of experimental 
conditions. These were first converted to z-score maps, to obtain approximatively 
Gaussian statistics in inactivated voxels. Using these maps as input data, we then 
compared the detection thresholds obtained by Gamma-Gaussian mixture modeling 
(GGM), fixed-window random thresholding and the varying-window extension, also 
using the unknown variance extension in both cases (see Section 3.1). For simplicity, 
we only present here the results obtained for a fixed window equal to K n = 15 000. 

5.2 Individual subject activation map 

Our first illustration concerns the activation map of a single subject, for the "sentence- 
checkerboard" contrast. This contrast subtracts the effect of viewing horizontal and 
vertical checkerboards from that of reading video instructions, thus allowing to detect 
brain regions specifically implicated in the reading task. 

Figure 6, left, shows an axial slice from the z-score map before thresholding. Acti- 
vations are clearly seen in Wernicke's and Broca's areas (right and upper right), which 
are known to be involved in language processing (see [ ] , for instance) . The detection 
threshold found by GGM fit for the z-score map (2.03) is much lower than those found 
by the random threshold procedure, both with a varying window (3.19) and a fixed 
window (3.33). 

The random thresholds with fixed and variable windows yield very similar activa- 
tion maps in this case, which seem to capture the activated regions seen in the raw 
map. In contrast, the much lower threshold found by mixture modeling detects several 
smaller clusters, some of which may be false positives. 
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Figure 6: Axial slice from a z-score map for the "sentence-checkerboard" con- 
trast, using a temperature palette for the z-score values. From left to right: Un- 
thresholded, thresholded by GGM model fit, varying-window and fixed-window 
[K n = 15 000) random thresholding. Detected activations are shown against 
the subject's anatomical image. 

5.3 Group activation map 

In this second example, we consider a group activation map, specifically a map of t- 
statistics computed from the individual contrast maps of 15 subjects, thus enabling to 
infer regions of positive mean effects in the parent population. Our choice of limiting 
the number of subjects, rather than using the whole cohort, was driven by the fact 
that many fMRI studies are conducted on groups of less then 20 subjects. 

We report results for the "calculation-sentences" contrast, which subtracts acti- 
vations due to reading or hearing instructions from the overall activations detected 
during the mental calculation tasks. This contrast may thus reveal regions that are 
specifically involved in the processing of numbers. 

Figure 6, left, shows an axial slice from the activation map before thresholding, with 
clear activations in the bilateral anterior cingulate (upper middle), bilateral parietal 
(lower left and right) and right precentral (upper right) regions, all known to be 
involved in number processing [14]. 

Though sorted in the same order as previously, the varying window random thresh- 
old (2.49) is now roughly at equal distances from the threshold found by GGM mod- 
eling (1.79) and the fixed window random threshold (3.06). 

The three methods detected activations in the regions described above, though 
the fixed window random threshold seemed to miss some activations, and the GGM 
approach further detected smaller clusters, some of which may be false positives. 

Of course one cannot conclude from these examples only that RT is 'better' at de- 
tecting activations than GGM fit. However, the varying window extension successfully 
detected regions known to be involved in the two cognitive tasks considered here, while 
avoiding isolated peaks in other regions, which may be part of the background noise. 
These results suggest that the RT succeeded in capturing only the active regions, while 
the GGM approach seemed to detect spurious activations. 

6 Conclusion 

In this paper, we have introduced a simple modification to the random threshold (RT) 
procedure proposed in [9], to obtain an entirely unsupervised procedure for recovering 
non null mean terms from a collection of independent random observations, based 
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Figure 7: Axial slice from the group activation z-score map for the "calculation- 
sentence" contrast, using a temperature palette for the z-score values. From left 
to right: Unthresholded, thresholded by GGM model fit, varying- window and 
fixed- window random thresholding. Detected activations are shown against the 
mean anatomical image of all subjects. 

solely on a parametric model of the null terms. Our modification, which requires no 
prior tuning, conserves the consistency properties of the original procedure. 

We have implemented all the different versions of the random threshold method 
in a Python package. This was integrated to the Neuroimaging in Python (NIPY) 
open-source library, freely downloadable from http://nipy.sourceforge.net. 

We validated this approach through extensive numerical experiments, and showed 
that both the original procedure, based on a fixed window-width, and our extension, 
which uses a variable window, compare favorably to both multiple testing procedures, 
and model-based clustering, in terms of the binary classification risk, with a slight ad- 
vantage to our varying window extension. On the vast majority of simulated datasets, 
the risks achieved where close to the lowest achievable (oracle) risk, whereas each of 
the other approaches behaved poorly in at least one case. 

Thus RT appears as a very promising method for non-ordered model selection 
whenever no parametric assumptions are available concerning the data distribution. 
Such methods are needed in many application domains, as we have illustrated in the 
case of activation detection for fMRI data analysis. 

The good classification performances of RT evidenced empirically in our simu- 
lations suggest that a promising direction for future research would be to study its 
properties in the mixture-model setting, and especially its large-sample behaviour. An 
interesting question to answer would be whether the random threshold converges to a 
certain limit when the SNR remains constant, and if so, how does this limit compares 
to the oracle threshold. 
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A Proof of the Theorem 2 for the varying window 
extension 



Following [ ], we first recall some notations. Set Ui = Y% for i £ Ik* and v% — Y% for 
i £ Ik*] notice that (vi) is a sample from the distribution F e . Let (^(i))i<i<fc* and 
(v(i))i<i<n-k* be the sequences (\Ui\) and (\vi\) in decreasing order. Let Q n be the 
subset of Q where < a n /2 and U(k*) > o> n /2. 

A first lemma in [9] shows that P(Q n ) — > 1, i.e., the collections (u^) and (f(i)) 
are stochastically in order with high probability. The proof can then be restricted to 
Q n . 

Now, let Efc(Tfc 9 j) and Qkj be defined as in Equation (3). Using Proposition 1, we 
have: 



MTkj) = i(i+ Yl V*); 

i=j+l 

_ E fc (T fc ,,-) 

^fcU fc,n-fc J 
— -Bk,j,nTk,n — k' 

Also, let ai = Eo(^(i)) = S™ =i 1/^- Equation (4) can be shown separately for 
k > kn and k < fc* . Since the two cases are treated similarly, we will restrict ourselves 
here to the case k > fc*. On Q n : 



Tk,j Qk,j — Tk,j -Bk,j,nTk,n — k 

— (Tk,j — Efc* (Tfc,j)) — Bk,j,n (Tk,n-k — Efc* (Tk,n-k)) 
+Ek*(Tk,j) — Bk,j,>nMk*(Tk,n-k) 

— Rk,j H~ $k,j • 

Thus Tfcj — Qkj is decomposed into a random part Rkj and a deterministic part 
Skj- Over Q n , is a function of V(k), • • • , V( n _k*)- Before going further, we now 

recall the following result: 

Let Z(i) > . . . > Z( n ) be an ordered sequence of independent Exp(l) random 
variables. For 1 < j < n, let Tj = J2i=i^(i)- Introduce for t £ [0,1] the random 
process d n (t) = T[ nt ] — E(T[ nt ] |T n ). Then it is shown in [ ] that -^=d n (t), as a process 
indexed on t G [0, 1], converges in distribution to a certain zero mean Gaussian process 
A. 

To use this result, let k = [tn] and j = [sn] , for < t < 1 — c and < s < 1 — £* — t, 
for c in [AF3]. Then ^=%{T k j - Qfc,j)ln„ = 7=f( T [tn],[ S n] - Q[tn] , [an] )ln n , as a 
process indexed by (£, s) £ (0, l) 2 , converges in distribution to the zero-mean Gaussian 
process: 

similarly, - 7 == J Bjt jJjn E/ e * (Tfe >n _fc)ln n converges in distribution to another zero- 
mean Gaussian process, and so does their sum, —==R kj jln n . 
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On the other hand, 

Sk,j — { a i+j — a i + Bk,j,n(cLi+n-k ~ Q>i)), (6) 

2=1 

so that there exists a constant 7 > 0, which depends on c in [AF3], such that for 
all n > 1, kn < k < n — K n , we have sup 1<J<n _ fc > 7(/c — /c*). Finally we use 

the following inequality: 

> nu n ) < P(?7fc* > inf r] k ). 

k — k^ l >nu n 

From Equation (6), = 0, hence it follows that: 

yjn- k* T] k ^ sup \R k * j + S k ^ ,j \ 

1<j <n — /c* 
= SUp |i?fc*,j| 

< sup sup 

k~>k^ l<j<n — k 

On the other hand, 

Vn- k inf 77^ = inf sup \R k j + S k ,j\ 

k — k^ l >nu n fc-fc^TWn KjXn-fc 

> inf sup \Sk,j\- sup sup \R k j\, 

k-k^>nu n i<j< n - k k>k^ l<j<n-k 

so that we have: 

Wk*(kn-k*>nun) < P(C sup sup \R kJ \ > inf sup \S kJ \) +F(Q c n ) 

k>k^ l<j<n-k k-k^>nu n i<j< n _fc 

< P(C SUp SUp > 7™n) +P(On), 

where C is a constant which depends on c in [AF3] . This last probability vanishes as 
n goes to infinity, due to the weak convergence of R k jln n \D 

B Details of the EM algorithm for the two-class 
GMM with a zero-mean class 

We consider the following model: 

Yi\Zi=j *~ Af(fij,a]), i = l,...,n, j = 0, 1 

Zi ~ B(l,pi), (7) 

where /x = 0, and pj represents the proportion of data in class j, so that the vector 
of model parameters is: = (po, Mi, 0"o, 

Having initialized to 6^°\ the EM algorithm alternates the following steps: 
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E-step. Compute the conditional law of the indicator variable Zi at step t, that 
is, the Bernoulli defined by: 

..,y. „«) = fW<=i^T 

■= P$ (8) 



M-step. Update the estimate of model parameters by maximizing the condi- 
tional expectation of the complete log-likelihood, yielding: 

(t+1) = argmjucE^log/^IZ^ir,^], 

i 

the expectation being taken with respect to the conditional distribution of the 
indicator variables Zi computed in the previous step. This yields: 

(*) 

(t+i) _ Z-nPij . 

Pi 1 

J n 

„(t+i) _ l^iJPijJ^. 
ri - ^ (t) ' 

2(*+D E^OWf) 



(J, 



(9) 

Note that, throughout the iterations, /Xq = 0. 

Initialization. An initial guess for ctq is provided by the negative data, which 
consists mainly of null data: 

a 2(0) = tf^O}- 1 ^ 2 . 

Yi>0 

Then, we use a kernel estimate of the data density: 

a.) - i£-^. m 

i 

for a symmetric, positive kernel K, and a bandwidth h n . In pratice, we used the 
Gaussian kernel K{x) — 27T~ 1 ^ 2 e~ x and h n = y/n. 

By identifying the null mode of the data density kernel estimate to the null compo- 
nent of the mixture model, we then obtained an initial guess for the mixture weights: 
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Finally, the conditional law of the indicator variables where approached by: 
p (o) = _ 8 .[ ^expj-d?/^)}] 

These initial guesses are used to derive initial model parameter values via the 
M-step described above, for t = 0. 
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